Smart exploration in reinforcement learning using absolute temporal difference errors
نویسندگان
چکیده
Exploration is still one of the crucial problems in reinforcement learning, especially for agents acting in safety-critical situations. We propose a new directed exploration method, based on a notion of state controlability. Intuitively, if an agent wants to stay safe, it should seek out states where the effects of its actions are easier to predict; we call such states more controllable. Our main contribution is a new notion of controlability, computed directly from temporaldifference errors. Unlike other existing approaches of this type, our method scales linearly with the number of state features, and is directly applicable to function approximation. Our method converges to correct values in the policy evaluation setting. We also demonstrate significantly faster learning when this exploration strategy is used in large control problems.
منابع مشابه
Consistent exploration improves convergence of reinforcement learning on POMDPs
This paper sets out the concept of consistent exploration of observation-action pairs. We present a new temporal difference algorithm, CEQ(λ), based on this concept and demonstrate using a randomly generated set of partially observable Markov decision processes (POMDPs) that it outperforms SARSA(λ). This result should generalise to any POMDP where satisficing policies which map observations to ...
متن کاملReducing Commitment to Tasks with Off-Policy Hierarchical Reinforcement Learning
In experimenting with off-policy temporal difference (TD) methods in hierarchical reinforcement learning (HRL) systems, we have observed unwanted on-policy learning under reproducible conditions. Here we present modifications to several TD methods that prevent unintentional on-policy learning from occurring. These modifications create a tension between exploration and learning. Traditional TD m...
متن کاملResource Constrained Exploration in Reinforcement Learning
This paper examines temporal difference reinforcement learning (RL) with adaptive and directed exploration for resource-limited missions. The scenario considered is for an energy-limited agent which must explore an unknown region to find new energy sources. The presented algorithm uses a Gaussian Process (GP) regression model to estimate the value function in an RL framework. However, to avoid ...
متن کاملAdaptive ε-greedy Exploration in Reinforcement Learning Based on Value Differences
This paper presents “Value-Difference Based Exploration” (VDBE), a method for balancing the exploration/exploitation dilemma inherent to reinforcement learning. The proposed method adapts the exploration parameter of ε-greedy in dependence of the temporal-difference error observed from value-function backups, which is considered as a measure of the agent’s uncertainty about the environment. VDB...
متن کاملUsing Reinforcement Learning to Make Smart Energy Storage Source in Microgrid
The use of renewable energy in power generation and sudden changes in load and fault in power transmission lines may cause a voltage drop in the system and challenge the reliability of the system. One way to compensate the changing nature of renewable energies in the short term without the need to disconnect loads or turn on other plants, is the use of renewable energy storage. The use of ener...
متن کامل